Give a brief a description of your project and its goal(s), what data you are using to complete it, and what three faculty/staff in different fields you have spoken to about your project with a brief summary of what you learned from each person. Include a link to your final project GitHub repository.
This project considers the burdens of poor air quality and its health consequences, and how they fall on different American communities. To explore and understand these relationships, I use daily site-level testing data from the EPA’s Air Quality System for small particulate matter (PM2.5) from 2019, linked with census data for the tract surrounding each testing site. I am still determining how I can best incorporate data on the near-term health effects of poor air quality, although my main goal is not to use air quality to predict those health outcomes, but rather to determine whether there are differences (e.g. in extent/significance) between how community characteristics predict air quality and how those same characteristics predict the prevalence of events like emergency department presentations for asthma exacerbation. In developing a plan for this project I met with Drs. Anil Vachani, Gary Weissman, and John P. Reilly, who provided insights on the causal pathway from pollution sources to specific EPA-measured pollutants to short- and long-term health outcomes; potential data sources and analytic approaches; the ways in which “natural experiments” related to major changes in pollutant generation have been used to compare the effects of perturbation from a steady state; and more. Materials for this project can be found in this GitHub repository.
Describe the problem addressed, its significance, and some background to motivate the problem. Explain why your problem is interdisciplinary, what fields can contribute to its understanding, and incorporate background related to what you learned from meeting with faculty/staff.
Poor air quality can have a substantial negative influence on health and quality of life, but different people will have different resources, incentives, and tradeoffs to consider in determining where to live, and we might expect that the relationship between who lives where and air quality is more complex than people with more wealth, income, and social power concentrating in places with better air quality. By considering many different community characteristics in building models to predict poor air quality, it is possible to learn which factors are most strongly associated with air quality and how successfully models trained on community characteristics can predict poor air quality and related health events.
Describe the data used and general methodological approach. Subsequently, incorporate full R code necessary to retrieve and clean data, and perform analysis. Be sure to include a description of code so that others (including your future self) can understand what you are doing and why.
I begin by defining a function to create a data frame from the results of a GET call to the EPA’s Air Quality System API. I create objects with relevant reference lists and define a vector for the “lower 48” U.S. states and Washington, D.C. for later use.
# Retrieve environment variables with EPA AQS API credentials
api.email <- Sys.getenv("EPA_AQS_EMAIL")
api.key <- Sys.getenv("EPA_AQS_KEY")
# Function to create a data frame based on an API call
download.AQS <- function(whichData,customParams) {
rootpath <- "https://aqs.epa.gov/data/api/"
morepath <- paste0(rootpath,whichData)
fullParams <- append(list(email = api.email, key = api.key),customParams)
step1.json <- httr::GET(url = morepath, query = fullParams)
step2.parsed <- jsonlite::fromJSON(httr::content(step1.json,"text"), simplifyVector = FALSE)
step3.df <- tibble(Data = step2.parsed$Data) %>% unnest_wider(.,Data)
return(step3.df)
}
# Fetch documentation for daily data summaries, state code list
dailyData.dictionary <- download.AQS("metaData/fieldsByService",list(service = "dailyData"))
list.states <- download.AQS("list/states",list())
# Omitting state codes other than 48 states + D.C.
list.states.lower48 <- list.states %>% filter(!(code %in% c("02","15","66","72","78","80","CC")))
list.states.lower48## # A tibble: 49 x 2
## code value_represented
## <chr> <chr>
## 1 01 Alabama
## 2 04 Arizona
## 3 05 Arkansas
## 4 06 California
## 5 08 Colorado
## 6 09 Connecticut
## 7 10 Delaware
## 8 11 District Of Columbia
## 9 12 Florida
## 10 13 Georgia
## # … with 39 more rows
To manage downloading daily PM2.5 data for each state for all of 2019, I define another function to loop through the state FIPS codes for given date parameters. Executing this step with 1 call per month of data takes roughly 90 minutes, so I save the resulting data frame and load it in the chunk below.
# Pull daily PM2.5 (parameter 88101) data for these states, with arguments for date range
get.daily.lower48 <- function(b,e){
collector <- data.frame()
for (s in seq_along(lower48.codes)) {
newstate <- download.AQS("dailyData/byState", list(param="88101", bdate=b, edate=e, state=list.states.lower48[s,1]))
collector <- bind_rows(collector,newstate)
}
return(collector)
}
# Run in monthly batches
dailyPM2.5_201901 <- get.daily.lower48("20190101","20190131")
dailyPM2.5_201902 <- get.daily.lower48("20190201","20190228")
dailyPM2.5_201903 <- get.daily.lower48("20190301","20190331")
dailyPM2.5_201904 <- get.daily.lower48("20190401","20190430")
dailyPM2.5_201905 <- get.daily.lower48("20190501","20190531")
dailyPM2.5_201906 <- get.daily.lower48("20190601","20190630")
dailyPM2.5_201907 <- get.daily.lower48("20190701","20190731")
dailyPM2.5_201908 <- get.daily.lower48("20190801","20190831")
dailyPM2.5_201909 <- get.daily.lower48("20190901","20190930")
dailyPM2.5_201910 <- get.daily.lower48("20191001","20191031")
dailyPM2.5_201911 <- get.daily.lower48("20191101","20191130")
dailyPM2.5_201912 <- get.daily.lower48("20191201","20191231")
# Concatenate monthly batch files into a single file for 2019
all.2019.dailyPM2.5 <- bind_rows( dailyPM2.5_201901
,dailyPM2.5_201902
,dailyPM2.5_201903
,dailyPM2.5_201904
,dailyPM2.5_201905
,dailyPM2.5_201906
,dailyPM2.5_201907
,dailyPM2.5_201908
,dailyPM2.5_201909
,dailyPM2.5_201910
,dailyPM2.5_201911
,dailyPM2.5_201912)
# Save full 2019 raw file
saveRDS(all.2019.dailyPM2.5, file = "all2019dailyPM25.RDS")I test combinations of testing site location identifiers and add a single site_id variable to the 2019 PM2.5 data set by concatenating location identifiers. I also create a data frame with one observation per testing site.
# Load the full 2019 raw file
all.2019.dailyPM2.5 <- readRDS("all2019dailyPM25.RDS")
str(all.2019.dailyPM2.5)## 'data.frame': 1448813 obs. of 31 variables:
## $ state_code : chr "01" "01" "01" "01" ...
## $ county_code : chr "027" "027" "027" "027" ...
## $ site_number : chr "0001" "0001" "0001" "0001" ...
## $ parameter_code : chr "88101" "88101" "88101" "88101" ...
## $ poc : int 1 1 1 1 1 1 1 1 1 1 ...
## $ latitude : num 33.3 33.3 33.3 33.3 33.3 ...
## $ longitude : num -85.8 -85.8 -85.8 -85.8 -85.8 ...
## $ datum : chr "NAD83" "NAD83" "NAD83" "NAD83" ...
## $ parameter : chr "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
## $ sample_duration : chr "24 HOUR" "24 HOUR" "24 HOUR" "24 HOUR" ...
## $ pollutant_standard : chr "PM25 24-hour 2006" "PM25 Annual 2006" "PM25 24-hour 2012" "PM25 Annual 2012" ...
## $ date_local : chr "2019-01-03" "2019-01-03" "2019-01-03" "2019-01-03" ...
## $ units_of_measure : chr "Micrograms/cubic meter (LC)" "Micrograms/cubic meter (LC)" "Micrograms/cubic meter (LC)" "Micrograms/cubic meter (LC)" ...
## $ event_type : chr "None" "None" "None" "None" ...
## $ observation_count : int 1 1 1 1 1 1 1 1 1 1 ...
## $ observation_percent: num 100 100 100 100 100 100 100 100 100 100 ...
## $ validity_indicator : chr "Y" "Y" "Y" "Y" ...
## $ arithmetic_mean : num 2.9 2.9 2.9 2.9 2.7 2.7 2.7 2.7 6.6 6.6 ...
## $ first_max_value : num 2.9 2.9 2.9 2.9 2.7 2.7 2.7 2.7 6.6 6.6 ...
## $ first_max_hour : int 0 0 0 0 0 0 0 0 0 0 ...
## $ aqi : int 12 12 12 12 11 11 11 11 28 28 ...
## $ method_code : chr "145" "145" "145" "145" ...
## $ method : chr "R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC - Gravimetric" "R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC - Gravimetric" "R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC - Gravimetric" "R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC - Gravimetric" ...
## $ local_site_name : chr "ASHLAND" "ASHLAND" "ASHLAND" "ASHLAND" ...
## $ site_address : chr "ASHLAND AIRPORT" "ASHLAND AIRPORT" "ASHLAND AIRPORT" "ASHLAND AIRPORT" ...
## $ state : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ county : chr "Clay" "Clay" "Clay" "Clay" ...
## $ city : chr "Ashland" "Ashland" "Ashland" "Ashland" ...
## $ date_of_last_change: chr "2020-03-10" "2020-03-10" "2020-03-10" "2020-03-10" ...
## $ cbsa_code : chr NA NA NA NA ...
## $ cbsa : chr NA NA NA NA ...
# Create a file of all unique combinations of geographic indicators for testing sites
testing.sites <- all.2019.dailyPM2.5 %>%
dplyr::select(state_code, state,
county_code, county,
city, cbsa_code, cbsa,
site_number, local_site_name, site_address,
latitude, longitude) %>%
unique() %>%
arrange(state_code,county_code,site_number)
dim(testing.sites)## [1] 965 12
# 965 combinations
# Create a single unique ID column for site
testing.sites %>% dplyr::select(site_number) %>% unique() %>% dim()## [1] 253 1
## [1] 770 2
## [1] 965 3
# Only 253 unique site numbers. 770 state-site combos.
# State + county + site number *is* unique (965 combos)
clean.2019.dailyPM2.5 <- all.2019.dailyPM2.5 %>%
mutate(site_id = paste0(state_code,county_code,site_number))Using the tigris::tracts function, I retrieve the geometries, GEOIDs, and other spatial information for census tracts and join each AQS testing site to the data for the tract within which it lies, based on the latitude and longitude from AQS. I looping through the state FIPS codes one at a time and stack the results into a single crosswalk file. This step is time-consuming so I save the resulting file and load it in the next step.
# Prepare for spatial merge. Create sf file from testing site coordinates
testing.sites.sf.points <- st_as_sf(testing.sites, coords = c('longitude','latitude'), crs = 4269)
# Download census tract sf shapefiles, one state at a time.
# Perform spatial joins one state at a time to avoid having to build the national tract shapefile.
for (s in seq_along(lower48.codes)) {
newstate <- lower48.codes[s]
newtracts <- tracts(newstate)
newjoin <- st_join(newtracts, testing.sites.sf.points, join = st_contains, left=FALSE)
if (s==1){
testing.sites.tracts <- newjoin
} else {
testing.sites.tracts <- bind_rows(testing.sites.tracts,newjoin)
}
}
# Save file with testing sites <> census tract
saveRDS(testing.sites.tracts, file = "testing_sites_tracts.RDS")From tidycensus I now retrieve the variable list for the 2018 5-year American Community Survey. After reviewing the concepts from those variables as well as the questionnaire, I create an object with a set of variables to test for associations with the AQS data.
## Simple feature collection with 6 features and 22 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -88.10077 ymin: 30.43231 xmax: -84.96303 ymax: 34.00093
## geographic CRS: NAD83
## STATEFP COUNTYFP TRACTCE GEOID NAME NAMELSAD MTFCC FUNCSTAT ALAND AWATER INTPTLAT INTPTLON state_code state county_code county city cbsa_code cbsa site_number local_site_name site_address geometry
## ...1 01 055 001600 01055001600 16 Census Tract 16 G5020 S 10464679 1540235 +33.9780622 -085.9775275 01 Alabama 055 Etowah Gadsden 23460 Gadsden, AL 0010 GADSDEN C. COLLEGE 1001 WALLACE DRIVE, GADSDEN, AL 35902 MULTIPOLYGON (((-85.99934 3...
## ...2 01 097 005300 01097005300 53 Census Tract 53 G5020 S 4987328 294542 +30.7789148 -088.0797620 01 Alabama 097 Mobile Chickasaw 33660 Mobile, AL 0003 CHICKASAW Iroquois and Azalea, CHICKASAW, MOBILE CO., ALABAMA MULTIPOLYGON (((-88.10077 3...
## ...3 01 003 011102 01003011102 111.02 Census Tract 111.02 G5020 S 44204283 813845 +30.4799296 -087.8330813 01 Alabama 003 Baldwin Fairhope 19300 Daphne-Fairhope-Foley, AL 0010 FAIRHOPE, Alabama FAIRHOPE HIGH SCHOOL, 1 PIRATE DRIVE, FAIRHOPE, ALABAMA MULTIPOLYGON (((-87.88641 3...
## ...4 01 101 002500 01101002500 25 Census Tract 25 G5020 S 4488073 37253 +32.4177882 -086.2722538 01 Alabama 101 Montgomery Montgomery 33860 Montgomery, AL 1002 MOMS, ADEM 1350 COLISEUM BLVD, MONTGOMERY, ALABAMA MULTIPOLYGON (((-86.28735 3...
## ...5 01 027 959000 01027959000 9590 Census Tract 9590 G5020 S 179658298 1129596 +33.3093759 -085.8820883 01 Alabama 027 Clay Ashland <NA> <NA> 0001 ASHLAND ASHLAND AIRPORT MULTIPOLYGON (((-86.01156 3...
## ...6 01 113 030800 01113030800 308 Census Tract 308 G5020 S 21342370 417245 +32.4345151 -084.9805005 01 Alabama 113 Russell Phenix City 17980 Columbus, GA-AL 0003 PHENIX CITY - SOUTH GIRARD SCHOOL 510 6th Place South, Phenix City, Alabama 36869 MULTIPOLYGON (((-85.01535 3...
# Load a data frame with all 5-year ACS variables
census.1yr.vars <- load_variables(dataset = "acs5",year = 2018)
head(census.1yr.vars)## # A tibble: 6 x 3
## name label concept
## <chr> <chr> <chr>
## 1 B00001_001 Estimate!!Total UNWEIGHTED SAMPLE COUNT OF THE POPULATION
## 2 B00002_001 Estimate!!Total UNWEIGHTED SAMPLE HOUSING UNITS
## 3 B01001_001 Estimate!!Total SEX BY AGE
## 4 B01001_002 Estimate!!Total!!Male SEX BY AGE
## 5 B01001_003 Estimate!!Total!!Male!!Under 5 years SEX BY AGE
## 6 B01001_004 Estimate!!Total!!Male!!5 to 9 years SEX BY AGE
# Simplify the variable list to make it easier to review the major concepts represented
census.1yr.concept.counts <-
census.1yr.vars %>%
mutate(root.concept = str_split_fixed(concept," BY",2)[,1]) %>%
mutate(root.concept = str_split_fixed(root.concept," \\(",2)[,1]) %>%
# Double escape needed to match open parenthesis!
mutate(root.name = substring(name,1,6)) %>%
group_by(root.name,root.concept) %>%
summarise(obs=n(), .groups="keep") %>%
arrange(root.name,-obs)
# View(sort(unique(census.1yr.concept.counts$root.concept)))
# After reviewing the variable list, create an object with the ones of interest
final.acs.var.list <- census.1yr.vars %>% filter(name %in% c(
"B01002_001" # median age
,"B03002_003","B03002_001" # Hon-Hisp-White-alone // Denom
,"B02009_001","B02001_001" # Black race, alone or in combination // Denom
,"B06008_003","B06008_001" # Now married // Denom
,"B23007_002","B23007_001" # Children under 18yrs in household // Denom
,"B15003_022","B15003_023","B15003_024","B15003_025","B15003_001" # Bachelors // Masters // Professional // Doctorate // Denom
,"B23022_027","B23022_026","B23022_003","B23022_002" # Worked in past 12 months-Female // Denom-Female // p12-Male // Denom-Male
,"B21001_002","B21001_001" # Veteran // Denom
,"B17001_002","B17001_001" # Past 12mo income at or below poverty level, Denom
,"B22010_002","B22010_001" # Received SNAP in past 12mo, Denom
,"B22008_001" # Median household income, past 12mo
,"B25008_003","B25008_001" # Renter occupied, Denom
,"B25024_002","B25024_003","B25024_001" # Single-fam, detached // single-fam, attached // Denom
,"B25064_001" # Median gross rent
,"B25088_002" # Median monthly owner costs for households with a mortgage
,"B08126_004","B08126_002","B08126_001" # INDUSTRY: Manufacturing // Agriculture, fishing, mining // Denom
,"B08006_003","B08006_001" # Drove to work alone // Denom
,"B08013_001" # Aggregate travel time to work in minutes
)) %>% arrange(name)
final.acs.var.list## # A tibble: 38 x 3
## name label concept
## <chr> <chr> <chr>
## 1 B01002_001 Estimate!!Median age --!!Total MEDIAN AGE BY SEX
## 2 B02001_001 Estimate!!Total RACE
## 3 B02009_001 Estimate!!Total BLACK OR AFRICAN AMERICAN ALONE OR IN COMBINATION WITH ONE OR MORE OTHER RACES
## 4 B03002_001 Estimate!!Total HISPANIC OR LATINO ORIGIN BY RACE
## 5 B03002_003 Estimate!!Total!!Not Hispanic or Latino!!White alone HISPANIC OR LATINO ORIGIN BY RACE
## 6 B06008_001 Estimate!!Total PLACE OF BIRTH BY MARITAL STATUS IN THE UNITED STATES
## 7 B06008_003 Estimate!!Total!!Now married, except separated PLACE OF BIRTH BY MARITAL STATUS IN THE UNITED STATES
## 8 B08006_001 Estimate!!Total SEX OF WORKERS BY MEANS OF TRANSPORTATION TO WORK
## 9 B08006_003 Estimate!!Total!!Car, truck, or van!!Drove alone SEX OF WORKERS BY MEANS OF TRANSPORTATION TO WORK
## 10 B08013_001 Estimate!!Aggregate travel time to work (in minutes) AGGREGATE TRAVEL TIME TO WORK (IN MINUTES) OF WORKERS BY SEX
## # … with 28 more rows
Once more I loop through the state FIPS codes, retrieving the ACS variables of interest by tract as a flat file using tidycensus::get_acs and preserving the data only for the census tracts that contain AQS testing sites. As this is another slow step, I save the resulting data frame.
head(testing.sites.tracts)
# Keep only what is needed to make the ACS calls in a small data frame
tst.minimal <- testing.sites.tracts %>%
mutate(site_id = paste0(state_code,county_code,site_number)) %>%
dplyr::select(site_id, GEOID, STATEFP, COUNTYFP)
st_geometry(tst.minimal) <- NULL
# Loop through existing list of state codes
lower48.codes
# Make ACS calls one state at a time, only for counties with AQS testing sites
# Preserve estimate columns, ditch MOE
# Join to testing site list
# Stack ACS data together in a single data frame
acs.loop <- function(){
for (st in seq_along(lower48.codes)) {
newstate <- lower48.codes[st]
county.list <- tst.minimal %>%
filter(STATEFP==newstate) %>%
dplyr::select(COUNTYFP) %>%
arrange(COUNTYFP) %>%
as.matrix() %>% as.vector() %>% as.list()
results.acs <- get_acs( geography = "tract"
,variables=as.list(final.acs.var.list$name)
,state = newstate
,county = county.list
,output = "wide")
smaller.acs <- results.acs %>% dplyr::select(GEOID,matches(".*_.*E$"))
linked.acs <- tst.minimal %>%
filter(STATEFP==newstate) %>%
left_join(.,smaller.acs,by="GEOID")
if (st==1){
collect <- linked.acs
} else {
collect <- bind_rows(collect,linked.acs)
}
}
return(collect)
}
testing.sites.acs <- acs.loop()
testing.sites.acs
saveRDS(testing.sites.acs, file = "testing_sites_acs.RDS")With all the ACS data I need, I can create clean versions of the variables to be included in the analysis. I exclude tracts that do not have complete data for all 19 ACS-based variables or with fewer than 500 respondents in the sample (using the denominator from the set of race variables).
## site_id GEOID STATEFP COUNTYFP B01002_001E B02001_001E B02009_001E B03002_001E B03002_003E B06008_001E B06008_003E B08006_001E B08006_003E B08013_001E B08126_001E B08126_002E B08126_004E B15003_001E B15003_022E B15003_023E B15003_024E B15003_025E B17001_001E B17001_002E B21001_001E B21001_002E B22008_001E B22010_001E B22010_002E B23007_001E B23007_002E B23022_002E B23022_003E
## 1 010550010 01055001600 01 055 37.1 3300 2264 3300 1011 2838 800 1303 1127 23920 1303 23 448 2295 219 110 60 25 3173 910 2724 273 36250 1232 156 653 200 1094 710
## 2 010970003 01097005300 01 097 34.2 1970 746 1970 1129 1410 472 639 564 12990 639 0 88 1205 120 60 14 0 1970 541 1319 120 32829 756 173 457 225 440 303
## 3 010030010 01003011102 01 003 43.0 4385 95 4385 3907 3479 2120 1953 1637 46550 1953 0 186 2936 762 260 117 8 4385 317 3215 343 64851 1633 104 1219 539 1191 907
## 4 011011002 01101002500 01 101 35.3 2317 1538 2317 500 1753 584 825 765 16105 825 0 139 1455 176 91 6 11 2304 567 1647 108 34109 852 292 586 283 653 482
## 5 010270001 01027959000 01 027 47.1 2975 506 2975 2141 2485 1044 1066 969 21705 1066 0 337 1988 174 49 20 22 2814 1053 2417 179 26875 1240 216 778 247 792 530
## 6 011130003 01113030800 01 113 39.5 3254 3024 3254 212 2755 583 1103 941 25900 1103 0 188 2066 59 30 0 0 3159 1425 2566 249 21017 1329 381 602 155 944 544
## B23022_026E B23022_027E B25008_001E B25008_003E B25024_001E B25024_002E B25024_003E B25064_001E B25088_002E
## 1 1170 735 3173 911 1650 1522 24 841 840
## 2 556 307 1970 895 896 627 46 598 1052
## 3 1409 1058 4385 328 1880 1555 19 937 1381
## 4 850 521 2317 1263 1122 969 42 780 856
## 5 971 602 2838 879 1569 1051 7 394 971
## 6 1258 734 3249 1509 1599 1126 21 604 944
# Create percentage variables and scaled medians/totals to use in modeling
testing.sites.acs.ready <- testing.sites.acs %>%
dplyr::rename( age.median = B01002_001E
,income.hh.median = B22008_001E
,rent.median = B25064_001E
,home.pmt.median = B25088_002E
,commute.time.agg = B08013_001E) %>%
mutate( race.black.any = 100*B02009_001E/B02001_001E
,ethn.nhisp.white.alone = 100*B03002_003E/B03002_001E
,married = 100*B06008_003E/B06008_001E
,kids = 100*B23007_002E/B23007_001E
,bachelor.plus = 100*(B15003_022E + B15003_023E + B15003_024E + B15003_025E)/B15003_001E
,veteran = 100*B21001_002E/B21001_001E
,manufacturing = 100*B08126_004E/B08126_001E
,farm.fish.mining = 100*B08126_002E/B08126_001E
,commutes.car.alone = 100*B08006_003E/B08006_001E
,renter = 100*B25008_003E/B25008_001E
,single.fam.home = 100*(B25024_002E + B25024_003E)/B25024_001E
,worked.12mo = 100*(B23022_027E + B23022_003E)/(B23022_026E + B23022_002E)
,poverty.12mo = 100*B17001_002E/B17001_001E
,snap.12mo = 100*B22010_002E/B22010_001E
,denominator = B02001_001E) %>%
dplyr::select(-matches("^B[012].*"))
head(testing.sites.acs.ready)## site_id GEOID STATEFP COUNTYFP age.median commute.time.agg income.hh.median rent.median home.pmt.median race.black.any ethn.nhisp.white.alone married kids bachelor.plus veteran manufacturing farm.fish.mining commutes.car.alone renter single.fam.home worked.12mo poverty.12mo snap.12mo denominator
## 1 010550010 01055001600 01 055 37.1 23920 36250 841 840 68.606061 30.636364 28.18887 30.62787 18.039216 10.022026 34.38219 1.765157 86.49271 28.710999 93.69697 63.82509 28.67948 12.662338 3300
## 2 010970003 01097005300 01 097 34.2 12990 32829 598 1052 37.868020 57.309645 33.47518 49.23414 16.099585 9.097801 13.77152 0.000000 88.26291 45.431472 75.11161 61.24498 27.46193 22.883598 1970
## 3 010030010 01003011102 01 003 43.0 46550 64851 937 1381 2.166477 89.099202 60.93705 44.21657 39.066757 10.668740 9.52381 0.000000 83.81976 7.480046 83.72340 75.57692 7.22919 6.368647 4385
## 4 011011002 01101002500 01 101 35.3 16105 34109 780 856 66.378938 21.579629 33.31432 48.29352 19.518900 6.557377 16.84848 0.000000 92.72727 54.510142 90.10695 66.73320 24.60938 34.272300 2317
## 5 010270001 01027959000 01 027 47.1 21705 26875 394 971 17.008403 71.966387 42.01207 31.74807 13.329980 7.405875 31.61351 0.000000 90.90056 30.972516 67.43149 64.20874 37.42004 17.419355 2975
## 6 011130003 01113030800 01 113 39.5 25900 21017 604 944 92.931776 6.515058 21.16152 25.74751 4.307841 9.703819 17.04442 0.000000 85.31278 46.445060 71.73233 58.03815 45.10921 28.668172 3254
## site_id GEOID STATEFP COUNTYFP age.median commute.time.agg income.hh.median rent.median home.pmt.median race.black.any ethn.nhisp.white.alone married kids bachelor.plus veteran manufacturing farm.fish.mining commutes.car.alone renter single.fam.home worked.12mo poverty.12mo
## Length:965 Length:965 Length:965 Length:965 Min. :19.20 Min. : 2175 Min. : 6953 Min. : 271.0 Min. : 541 Min. : 0.000 Min. : 0.00 Min. : 1.076 Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.000 Min. : 0.0000 Min. : 3.144 Min. : 0.00 Min. : 0.00 Min. : 9.004 Min. : 0.6641
## Class :character Class :character Class :character Class :character 1st Qu.:32.70 1st Qu.: 25162 1st Qu.: 35088 1st Qu.: 706.0 1st Qu.:1030 1st Qu.: 1.767 1st Qu.:36.90 1st Qu.:31.764 1st Qu.: 35.50 1st Qu.:13.25 1st Qu.: 4.688 1st Qu.: 5.119 1st Qu.: 0.0000 1st Qu.:69.001 1st Qu.: 24.17 1st Qu.: 51.07 1st Qu.:68.488 1st Qu.: 9.6620
## Mode :character Mode :character Mode :character Mode :character Median :37.20 Median : 37895 Median : 48700 Median : 863.0 Median :1282 Median : 5.738 Median :66.52 Median :42.238 Median : 42.71 Median :21.14 Median : 7.185 Median : 9.038 Median : 0.7683 Median :77.513 Median : 39.78 Median : 68.73 Median :75.509 Median :16.6063
## Mean :37.75 Mean : 45979 Mean : 52509 Mean : 942.3 Mean :1412 Mean :16.031 Mean :59.65 Mean :41.888 Mean : 43.10 Mean :25.23 Mean : 7.422 Mean :10.364 Mean : 3.0371 Mean :74.382 Mean : 42.77 Mean : 63.96 Mean :74.135 Mean :19.7553
## 3rd Qu.:42.50 3rd Qu.: 58158 3rd Qu.: 64558 3rd Qu.:1075.0 3rd Qu.:1621 3rd Qu.:19.825 3rd Qu.:85.59 3rd Qu.:53.168 3rd Qu.: 50.56 3rd Qu.:33.17 3rd Qu.: 9.661 3rd Qu.:14.093 3rd Qu.: 2.8991 3rd Qu.:83.803 3rd Qu.: 58.57 3rd Qu.: 81.32 3rd Qu.:81.789 3rd Qu.:27.9066
## Max. :70.00 Max. :305225 Max. :193239 Max. :3017.0 Max. :4001 Max. :99.358 Max. :98.92 Max. :74.400 Max. :100.00 Max. :88.74 Max. :34.728 Max. :47.056 Max. :65.6162 Max. :95.970 Max. :100.00 Max. :100.00 Max. :95.032 Max. :73.4463
## NA's :2 NA's :7 NA's :6 NA's :12 NA's :27 NA's :2 NA's :2 NA's :2 NA's :5 NA's :2 NA's :2 NA's :5 NA's :5 NA's :5 NA's :5 NA's :5 NA's :2 NA's :5
## snap.12mo denominator
## Min. : 0.00 Min. : 0
## 1st Qu.: 6.49 1st Qu.: 2853
## Median :13.87 Median : 4145
## Mean :17.09 Mean : 4521
## 3rd Qu.:24.60 3rd Qu.: 5646
## Max. :72.19 Max. :29636
## NA's :5
# testing.sites.acs.ready %>% dplyr::filter(denominator < 1000) %>% View()
# testing.sites.acs.ready %>% dplyr::filter(denominator < 500 | !complete.cases(.)) %>% View()
length(unique(testing.sites.acs.ready$GEOID))## [1] 953
# 953 - not unique by tract
# 34 tracts to exclude based on incomplete data and total denominator
# Bring the aggregates and medians to a comparable scale
testing.sites.acs.complete <-
testing.sites.acs.ready %>%
mutate(commute.time.agg = commute.time.agg/10000
,income.hh.median = income.hh.median/10000
,rent.median = rent.median/100
,home.pmt.median = home.pmt.median/100
) %>%
dplyr::rename(commute.time.agg.10k = commute.time.agg
,income.hh.median.10k = income.hh.median
,rent.median.100 = rent.median
,home.pmt.median.100 = home.pmt.median) %>%
dplyr::filter(denominator >= 500 & complete.cases(.))
summary(testing.sites.acs.complete)## site_id GEOID STATEFP COUNTYFP age.median commute.time.agg.10k income.hh.median.10k rent.median.100 home.pmt.median.100 race.black.any ethn.nhisp.white.alone married kids bachelor.plus veteran manufacturing farm.fish.mining commutes.car.alone renter single.fam.home worked.12mo
## Length:931 Length:931 Length:931 Length:931 Min. :19.60 Min. : 0.2175 Min. : 1.060 Min. : 2.710 Min. : 5.41 Min. : 0.000 Min. : 0.00 Min. : 3.058 Min. : 8.772 Min. : 0.4713 Min. : 0.000 Min. : 0.000 Min. : 0.0000 Min. : 3.144 Min. : 1.335 Min. : 0.00 Min. :26.86
## Class :character Class :character Class :character Class :character 1st Qu.:32.80 1st Qu.: 2.5518 1st Qu.: 3.572 1st Qu.: 7.115 1st Qu.:10.26 1st Qu.: 1.729 1st Qu.:38.07 1st Qu.:32.269 1st Qu.:35.491 1st Qu.:13.4215 1st Qu.: 4.801 1st Qu.: 5.259 1st Qu.: 0.0000 1st Qu.:69.402 1st Qu.:24.272 1st Qu.: 51.92 1st Qu.:68.82
## Mode :character Mode :character Mode :character Mode :character Median :37.20 Median : 3.8400 Median : 4.924 Median : 8.650 Median :12.79 Median : 5.467 Median :67.88 Median :42.618 Median :42.688 Median :21.2064 Median : 7.232 Median : 9.166 Median : 0.7833 Median :77.762 Median :39.391 Median : 68.99 Median :75.65
## Mean :37.81 Mean : 4.6502 Mean : 5.260 Mean : 9.431 Mean :14.04 Mean :15.471 Mean :60.27 Mean :42.350 Mean :42.955 Mean :25.3015 Mean : 7.424 Mean :10.435 Mean : 3.0807 Mean :74.849 Mean :42.076 Mean : 64.48 Mean :74.51
## 3rd Qu.:42.55 3rd Qu.: 5.8768 3rd Qu.: 6.447 3rd Qu.:10.730 3rd Qu.:16.12 3rd Qu.:19.079 3rd Qu.:85.73 3rd Qu.:53.228 3rd Qu.:50.442 3rd Qu.:33.0601 3rd Qu.: 9.677 3rd Qu.:14.095 3rd Qu.: 3.0265 3rd Qu.:83.937 3rd Qu.:57.313 3rd Qu.: 81.31 3rd Qu.:81.82
## Max. :70.00 Max. :30.5225 Max. :18.531 Max. :30.170 Max. :40.01 Max. :99.358 Max. :98.92 Max. :73.157 Max. :80.645 Max. :88.7425 Max. :31.705 Max. :47.056 Max. :65.6162 Max. :95.970 Max. :97.086 Max. :100.00 Max. :95.03
## poverty.12mo snap.12mo denominator
## Min. : 0.6641 Min. : 0.000 Min. : 635
## 1st Qu.: 9.7102 1st Qu.: 6.578 1st Qu.: 2884
## Median :16.5067 Median :13.905 Median : 4167
## Mean :19.4242 Mean :16.916 Mean : 4579
## 3rd Qu.:27.1693 3rd Qu.:24.274 3rd Qu.: 5684
## Max. :73.4463 Max. :72.193 Max. :29636
## [1] 920
I join the census tract GEOIDs to the daily PM2.5 data, and clean and summarize the PM2.5 data by tract. Tracts with fewer than 50 days of valid data are excluded.
## state_code county_code site_number parameter_code poc latitude longitude datum parameter sample_duration pollutant_standard date_local units_of_measure event_type observation_count observation_percent validity_indicator arithmetic_mean first_max_value first_max_hour aqi
## Length:1448813 Length:1448813 Length:1448813 Length:1448813 Min. : 1.000 Min. :25.47 Min. :-124.18 Length:1448813 Length:1448813 Length:1448813 Length:1448813 Length:1448813 Length:1448813 Length:1448813 Min. : 1.000 Min. : 4.00 Length:1448813 Min. : -5.200 Min. : -5.000 Min. : 0.00 Min. : 0.00
## Class :character Class :character Class :character Class :character 1st Qu.: 1.000 1st Qu.:35.73 1st Qu.:-109.56 Class :character Class :character Class :character Class :character Class :character Class :character Class :character 1st Qu.: 1.000 1st Qu.:100.00 Class :character 1st Qu.: 4.300 1st Qu.: 4.600 1st Qu.: 0.00 1st Qu.: 18.00
## Mode :character Mode :character Mode :character Mode :character Median : 3.000 Median :39.58 Median : -90.20 Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Median : 1.000 Median :100.00 Mode :character Median : 6.500 Median : 7.100 Median : 0.00 Median : 27.00
## Mean : 2.718 Mean :38.90 Mean : -94.36 Mean : 4.656 Mean : 99.72 Mean : 7.443 Mean : 8.686 Mean : 1.86 Mean : 29.97
## 3rd Qu.: 3.000 3rd Qu.:41.77 3rd Qu.: -81.24 3rd Qu.: 1.000 3rd Qu.:100.00 3rd Qu.: 9.500 3rd Qu.: 10.900 3rd Qu.: 0.00 3rd Qu.: 40.00
## Max. :33.000 Max. :48.76 Max. : -68.02 Max. :24.000 Max. :100.00 Max. :113.083 Max. :896.000 Max. :23.00 Max. :181.00
## NA's :234562
## method_code method local_site_name site_address state county city date_of_last_change cbsa_code cbsa site_id
## Length:1448813 Length:1448813 Length:1448813 Length:1448813 Length:1448813 Length:1448813 Length:1448813 Length:1448813 Length:1448813 Length:1448813 Length:1448813
## Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character
##
##
##
##
# Create a file of site IDs with GEOIDs that have complete ACS data.
site_id.GEOID.xwalk <- testing.sites.acs.complete %>%
dplyr::select(site_id,GEOID) %>%
unique()
GEOID.dailyPM2.5 <- inner_join( site_id.GEOID.xwalk
,clean.2019.dailyPM2.5
,by="site_id")
dim(GEOID.dailyPM2.5)## [1] 1396734 33
## [1] 243809 2
##
## Excluded Included None <NA>
## 2200 55291 1339243 0
##
## N Y <NA>
## 4282 1392452 0
##
## Excluded Included None <NA>
## N 3 221 4058 0
## Y 2197 55070 1335185 0
## <NA> 0 0 0 0
# Y, N. 4282 x N
# Excluding validity_indicator == "N" does not get rid of a meaningful % of events
# GEOID.dailyPM2.5 %>% dplyr::filter(validity_indicator == "N") %>% View()
summary(GEOID.dailyPM2.5$arithmetic_mean)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -5.200 4.292 6.463 7.418 9.500 113.083
# Create preliminary outcome variables
GEOID.daily.summary <-
GEOID.dailyPM2.5 %>%
dplyr::filter(validity_indicator=="Y") %>%
mutate(arithmetic_mean = case_when(arithmetic_mean < 0 ~ 0, TRUE ~ arithmetic_mean)) %>%
group_by(GEOID,date_local) %>%
summarise(mean = mean(arithmetic_mean),.groups = "keep")
# table(GEOID.daily.summary$any_event)
summary(GEOID.daily.summary$mean)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 4.218 6.407 7.354 9.403 113.017
GEOID.AQI.outcomes <-
GEOID.daily.summary %>%
mutate(ungreen = case_when(mean > 12.0 ~ 1L, TRUE ~ 0L)) %>%
group_by(GEOID) %>%
summarise( pm2.5_days = n()
,pm2.5_p50 = quantile(mean, 0.50)
,pm2.5_ungreen_days = sum(ungreen)
,.groups="keep")
head(GEOID.AQI.outcomes)## # A tibble: 6 x 4
## # Groups: GEOID [6]
## GEOID pm2.5_days pm2.5_p50 pm2.5_ungreen_days
## <chr> <int> <dbl> <int>
## 1 01003011102 107 7.3 8
## 2 01027959000 107 7.3 10
## 3 01033020701 68 6.75 4
## 4 01049960700 108 7.15 5
## 5 01055001600 114 7.9 15
## 6 01069040600 115 7.5 16
## GEOID pm2.5_days pm2.5_p50 pm2.5_ungreen_days
## Length:920 Min. : 6.0 Min. : 0.2236 Min. : 0.00
## Class :character 1st Qu.:119.0 1st Qu.: 5.5000 1st Qu.: 11.00
## Mode :character Median :341.0 Median : 6.7048 Median : 23.00
## Mean :261.5 Mean : 6.6179 Mean : 33.54
## 3rd Qu.:358.0 3rd Qu.: 7.7808 3rd Qu.: 49.00
## Max. :365.0 Max. :13.6167 Max. :227.00
##
## 6 27 28 34 37 39 45 46 49 50 53 54 56 57 58 59 60 61 62 64 66 68 70 75 84 85 89 90 93 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 125 127 131 142 143 146 151 152 157 159 164 168 169 171 176 189 191 195 200 203 208 209 211 213 214 226 227 234 236 239 241 244 246 248 261 262 269 270 271 272 273 274 275 277
## 1 2 1 1 1 1 1 1 2 2 3 1 4 1 5 7 3 7 2 3 1 2 1 1 1 2 1 3 1 1 2 1 1 2 1 2 1 3 5 4 7 1 3 6 15 6 13 18 18 14 28 28 30 30 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1
## 284 286 288 290 291 292 293 299 300 303 304 305 306 307 308 309 310 311 312 313 315 316 318 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365
## 2 1 1 1 2 1 1 1 1 2 3 2 1 1 1 1 1 1 2 2 1 5 2 2 1 2 1 1 2 2 3 2 2 5 4 3 4 5 3 3 5 3 8 5 6 5 5 8 6 8 11 10 10 17 13 14 17 25 20 30 20 31 32 30 24 21 28 18 56
# Create final PM2.5 outcome variables
AQI.outcomes.final <-
GEOID.AQI.outcomes %>%
dplyr::filter(pm2.5_days >= 50) %>%
mutate( pm2.5_pct_ungreen = 100 * pm2.5_ungreen_days / pm2.5_days
,pm2.5_flag10_ungreen = case_when(pm2.5_ungreen_days * 10 > pm2.5_days ~ 1L, TRUE ~ 0L)
,pm2.5_flag20_ungreen = case_when(pm2.5_ungreen_days * 5 > pm2.5_days ~ 1L, TRUE ~ 0L)
) %>%
mutate( pm2.5_flag10_ungreen = factor(pm2.5_flag10_ungreen
,levels = c(0,1)
,labels = c("0-10%",">10%"))
,pm2.5_flag20_ungreen = factor(pm2.5_flag20_ungreen
,levels = c(0,1)
,labels = c("0-20%",">20%"))) %>%
dplyr::select(GEOID, pm2.5_p50, pm2.5_pct_ungreen, pm2.5_flag10_ungreen, pm2.5_flag20_ungreen)
# Review created outcome variables
head(AQI.outcomes.final)## # A tibble: 6 x 5
## # Groups: GEOID [6]
## GEOID pm2.5_p50 pm2.5_pct_ungreen pm2.5_flag10_ungreen pm2.5_flag20_ungreen
## <chr> <dbl> <dbl> <fct> <fct>
## 1 01003011102 7.3 7.48 0-10% 0-20%
## 2 01027959000 7.3 9.35 0-10% 0-20%
## 3 01033020701 6.75 5.88 0-10% 0-20%
## 4 01049960700 7.15 4.63 0-10% 0-20%
## 5 01055001600 7.9 13.2 >10% 0-20%
## 6 01069040600 7.5 13.9 >10% 0-20%
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2236 5.5000 6.7071 6.6228 7.7810 13.6167
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 5.882 12.045 13.120 18.103 62.192
##
## 0-10% >10%
## 372 537
##
## 0-20% >20%
## 741 168
## GEOID pm2.5_p50 pm2.5_pct_ungreen pm2.5_flag10_ungreen pm2.5_flag20_ungreen
## Length:537 Min. : 3.710 Min. :10.06 0-10%: 0 0-20%:369
## Class :character 1st Qu.: 6.707 1st Qu.:13.22 >10% :537 >20% :168
## Mode :character Median : 7.562 Median :16.67
## Mean : 7.620 Mean :18.82
## 3rd Qu.: 8.434 3rd Qu.:21.78
## Max. :13.617 Max. :62.19
## GEOID pm2.5_p50 pm2.5_pct_ungreen pm2.5_flag10_ungreen pm2.5_flag20_ungreen
## Length:168 Min. : 4.800 Min. :20.11 0-10%: 0 0-20%: 0
## Class :character 1st Qu.: 8.280 1st Qu.:22.11 >10% :168 >20% :168
## Mode :character Median : 8.801 Median :25.79
## Mean : 8.893 Mean :27.81
## 3rd Qu.: 9.500 3rd Qu.:30.73
## Max. :13.617 Max. :62.19
With both predictors and outcomes summarized by census tract, the ACS and AQS files can now be joined to create the final data set.
## tibble [909 × 5] (S3: grouped_df/tbl_df/tbl/data.frame)
## $ GEOID : chr [1:909] "01003011102" "01027959000" "01033020701" "01049960700" ...
## $ pm2.5_p50 : Named num [1:909] 7.3 7.3 6.75 7.15 7.9 ...
## ..- attr(*, "names")= chr [1:909] "50%" "50%" "50%" "50%" ...
## $ pm2.5_pct_ungreen : num [1:909] 7.48 9.35 5.88 4.63 13.16 ...
## $ pm2.5_flag10_ungreen: Factor w/ 2 levels "0-10%",">10%": 1 1 1 1 2 2 2 2 2 2 ...
## $ pm2.5_flag20_ungreen: Factor w/ 2 levels "0-20%",">20%": 1 1 1 1 1 1 2 2 1 1 ...
## - attr(*, "groups")= tibble [909 × 2] (S3: tbl_df/tbl/data.frame)
## ..$ GEOID: chr [1:909] "01003011102" "01027959000" "01033020701" "01049960700" ...
## ..$ .rows: list<int> [1:909]
## .. ..$ : int 1
## .. ..$ : int 2
## .. ..$ : int 3
## .. ..$ : int 4
## .. ..$ : int 5
## .. ..$ : int 6
## .. ..$ : int 7
## .. ..$ : int 8
## .. ..$ : int 9
## .. ..$ : int 10
## .. ..$ : int 11
## .. ..$ : int 12
## .. ..$ : int 13
## .. ..$ : int 14
## .. ..$ : int 15
## .. ..$ : int 16
## .. ..$ : int 17
## .. ..$ : int 18
## .. ..$ : int 19
## .. ..$ : int 20
## .. ..$ : int 21
## .. ..$ : int 22
## .. ..$ : int 23
## .. ..$ : int 24
## .. ..$ : int 25
## .. ..$ : int 26
## .. ..$ : int 27
## .. ..$ : int 28
## .. ..$ : int 29
## .. ..$ : int 30
## .. ..$ : int 31
## .. ..$ : int 32
## .. ..$ : int 33
## .. ..$ : int 34
## .. ..$ : int 35
## .. ..$ : int 36
## .. ..$ : int 37
## .. ..$ : int 38
## .. ..$ : int 39
## .. ..$ : int 40
## .. ..$ : int 41
## .. ..$ : int 42
## .. ..$ : int 43
## .. ..$ : int 44
## .. ..$ : int 45
## .. ..$ : int 46
## .. ..$ : int 47
## .. ..$ : int 48
## .. ..$ : int 49
## .. ..$ : int 50
## .. ..$ : int 51
## .. ..$ : int 52
## .. ..$ : int 53
## .. ..$ : int 54
## .. ..$ : int 55
## .. ..$ : int 56
## .. ..$ : int 57
## .. ..$ : int 58
## .. ..$ : int 59
## .. ..$ : int 60
## .. ..$ : int 61
## .. ..$ : int 62
## .. ..$ : int 63
## .. ..$ : int 64
## .. ..$ : int 65
## .. ..$ : int 66
## .. ..$ : int 67
## .. ..$ : int 68
## .. ..$ : int 69
## .. ..$ : int 70
## .. ..$ : int 71
## .. ..$ : int 72
## .. ..$ : int 73
## .. ..$ : int 74
## .. ..$ : int 75
## .. ..$ : int 76
## .. ..$ : int 77
## .. ..$ : int 78
## .. ..$ : int 79
## .. ..$ : int 80
## .. ..$ : int 81
## .. ..$ : int 82
## .. ..$ : int 83
## .. ..$ : int 84
## .. ..$ : int 85
## .. ..$ : int 86
## .. ..$ : int 87
## .. ..$ : int 88
## .. ..$ : int 89
## .. ..$ : int 90
## .. ..$ : int 91
## .. ..$ : int 92
## .. ..$ : int 93
## .. ..$ : int 94
## .. ..$ : int 95
## .. ..$ : int 96
## .. ..$ : int 97
## .. ..$ : int 98
## .. ..$ : int 99
## .. .. [list output truncated]
## .. ..@ ptype: int(0)
## ..- attr(*, ".drop")= logi TRUE
## 'data.frame': 931 obs. of 24 variables:
## $ site_id : chr "010550010" "010970003" "010030010" "011011002" ...
## $ GEOID : chr "01055001600" "01097005300" "01003011102" "01101002500" ...
## $ STATEFP : chr "01" "01" "01" "01" ...
## $ COUNTYFP : chr "055" "097" "003" "101" ...
## $ age.median : num 37.1 34.2 43 35.3 47.1 39.5 40.4 33.4 33.7 38.3 ...
## $ commute.time.agg.10k : num 2.39 1.3 4.66 1.61 2.17 ...
## $ income.hh.median.10k : num 3.62 3.28 6.49 3.41 2.69 ...
## $ rent.median.100 : num 8.41 5.98 9.37 7.8 3.94 6.04 8.11 7.87 6.85 9.49 ...
## $ home.pmt.median.100 : num 8.4 10.52 13.81 8.56 9.71 ...
## $ race.black.any : num 68.61 37.87 2.17 66.38 17.01 ...
## $ ethn.nhisp.white.alone: num 30.6 57.3 89.1 21.6 72 ...
## $ married : num 28.2 33.5 60.9 33.3 42 ...
## $ kids : num 30.6 49.2 44.2 48.3 31.7 ...
## $ bachelor.plus : num 18 16.1 39.1 19.5 13.3 ...
## $ veteran : num 10.02 9.1 10.67 6.56 7.41 ...
## $ manufacturing : num 34.38 13.77 9.52 16.85 31.61 ...
## $ farm.fish.mining : num 1.77 0 0 0 0 ...
## $ commutes.car.alone : num 86.5 88.3 83.8 92.7 90.9 ...
## $ renter : num 28.71 45.43 7.48 54.51 30.97 ...
## $ single.fam.home : num 93.7 75.1 83.7 90.1 67.4 ...
## $ worked.12mo : num 63.8 61.2 75.6 66.7 64.2 ...
## $ poverty.12mo : num 28.68 27.46 7.23 24.61 37.42 ...
## $ snap.12mo : num 12.66 22.88 6.37 34.27 17.42 ...
## $ denominator : num 3300 1970 4385 2317 2975 ...
# Drop unnecessary geographic variables from the ACS file
acs.joinready <-
testing.sites.acs.complete %>%
dplyr::select(-site_id, -STATEFP, -COUNTYFP) %>%
unique()
dim(acs.joinready)## [1] 920 21
## [1] 920
# 920 rows, unique on GEOID
# Join PM2.5 and ACS data by tract on GEOID
AQI.ACS <- inner_join(AQI.outcomes.final,acs.joinready,by="GEOID")
head(AQI.ACS)## # A tibble: 6 x 25
## # Groups: GEOID [6]
## GEOID pm2.5_p50 pm2.5_pct_ungreen pm2.5_flag10_ungreen pm2.5_flag20_ungreen age.median commute.time.agg.10k income.hh.median.10k rent.median.100 home.pmt.median.100 race.black.any ethn.nhisp.white.alone married kids bachelor.plus veteran manufacturing farm.fish.mining commutes.car.alone renter single.fam.home worked.12mo poverty.12mo snap.12mo denominator
## <chr> <dbl> <dbl> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 01003011102 7.3 7.48 0-10% 0-20% 43 4.66 6.49 9.37 13.8 2.17 89.1 60.9 44.2 39.1 10.7 9.52 0 83.8 7.48 83.7 75.6 7.23 6.37 4385
## 2 01027959000 7.3 9.35 0-10% 0-20% 47.1 2.17 2.69 3.94 9.71 17.0 72.0 42.0 31.7 13.3 7.41 31.6 0 90.9 31.0 67.4 64.2 37.4 17.4 2975
## 3 01033020701 6.75 5.88 0-10% 0-20% 44.1 5.96 6.34 7.61 11.9 11.1 82.8 55.0 33.1 24.1 10.4 23.1 0 95.5 13.6 76.2 73.2 8.67 4.50 6224
## 4 01049960700 7.15 4.63 0-10% 0-20% 33.7 10.3 3.49 6.85 9.56 1.01 59.9 56.3 46.4 8.55 3.65 36.2 2.47 86.9 30.6 53.8 62.5 28.8 17.5 9067
## 5 01055001600 7.9 13.2 >10% 0-20% 37.1 2.39 3.62 8.41 8.4 68.6 30.6 28.2 30.6 18.0 10.0 34.4 1.77 86.5 28.7 93.7 63.8 28.7 12.7 3300
## 6 01069040600 7.5 13.9 >10% 0-20% 33.8 1.04 1.46 5.62 7.13 85.6 12.7 12.8 52.5 4.73 5.78 12.8 0.753 74.0 76.5 55.0 54.1 57.2 45.2 2021
## Classes 'sf' and 'data.frame': 965 obs. of 23 variables:
## $ STATEFP : chr "01" "01" "01" "01" ...
## $ COUNTYFP : chr "055" "097" "003" "101" ...
## $ TRACTCE : chr "001600" "005300" "011102" "002500" ...
## $ GEOID : chr "01055001600" "01097005300" "01003011102" "01101002500" ...
## $ NAME : chr "16" "53" "111.02" "25" ...
## $ NAMELSAD : chr "Census Tract 16" "Census Tract 53" "Census Tract 111.02" "Census Tract 25" ...
## $ MTFCC : chr "G5020" "G5020" "G5020" "G5020" ...
## $ FUNCSTAT : chr "S" "S" "S" "S" ...
## $ ALAND : num 1.05e+07 4.99e+06 4.42e+07 4.49e+06 1.80e+08 ...
## $ AWATER : num 1540235 294542 813845 37253 1129596 ...
## $ INTPTLAT : chr "+33.9780622" "+30.7789148" "+30.4799296" "+32.4177882" ...
## $ INTPTLON : chr "-085.9775275" "-088.0797620" "-087.8330813" "-086.2722538" ...
## $ state_code : chr "01" "01" "01" "01" ...
## $ state : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ county_code : chr "055" "097" "003" "101" ...
## $ county : chr "Etowah" "Mobile" "Baldwin" "Montgomery" ...
## $ city : chr "Gadsden" "Chickasaw" "Fairhope" "Montgomery" ...
## $ cbsa_code : chr "23460" "33660" "19300" "33860" ...
## $ cbsa : chr "Gadsden, AL" "Mobile, AL" "Daphne-Fairhope-Foley, AL" "Montgomery, AL" ...
## $ site_number : chr "0010" "0003" "0010" "1002" ...
## $ local_site_name: chr "GADSDEN C. COLLEGE" "CHICKASAW" "FAIRHOPE, Alabama" "MOMS, ADEM" ...
## $ site_address : chr "1001 WALLACE DRIVE, GADSDEN, AL 35902" "Iroquois and Azalea, CHICKASAW, MOBILE CO., ALABAMA" "FAIRHOPE HIGH SCHOOL, 1 PIRATE DRIVE, FAIRHOPE, ALABAMA" "1350 COLISEUM BLVD, MONTGOMERY, ALABAMA" ...
## $ geometry :sfc_GEOMETRY of length 965; first list element: List of 1
## ..$ :List of 1
## .. ..$ : num [1:424, 1:2] -86 -86 -86 -86 -86 ...
## ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
## - attr(*, "sf_column")= chr "geometry"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
## ..- attr(*, "names")= chr [1:22] "STATEFP" "COUNTYFP" "TRACTCE" "GEOID" ...
geom.joinready <-
testing.sites.tracts %>%
dplyr::select(GEOID,NAMELSAD,STATEFP,state,COUNTYFP,county,city,cbsa,cbsa_code,geometry) %>%
unique()
dim(geom.joinready)## [1] 954 10
## [1] 953
## `summarise()` ungrouping output (override with `.groups` argument)
## Simple feature collection with 6 features and 2 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -120.6506 ymin: 30.43231 xmax: -85.77036 ymax: 35.11134
## geographic CRS: NAD83
## # A tibble: 6 x 3
## GEOID obs geometry
## <chr> <int> <MULTIPOLYGON [°]>
## 1 06079012304 2 (((-120.6504 34.97578, -120.6452 34.9935, -120.6436 34.99941, -120.6433 34.99998, -1...
## 2 01003011102 1 (((-87.88641 30.47292, -87.88641 30.47597, -87.8864 30.47612, -87.8864 30.47784, -87...
## 3 01027959000 1 (((-86.01156 33.29432, -86.00613 33.2943, -86.00355 33.29429, -85.99997 33.29427, -8...
## 4 01033020701 1 (((-87.67893 34.70889, -87.67891 34.71478, -87.67891 34.71493, -87.67794 34.7148, -8...
## 5 01049960700 1 (((-86.10822 34.29816, -86.10801 34.29831, -86.10724 34.29879, -86.10644 34.29897, -...
## 6 01055001600 1 (((-85.99934 33.98564, -85.99933 33.98592, -85.99927 33.9864, -85.99918 33.98703, -8...
## Simple feature collection with 2 features and 9 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -120.6506 ymin: 34.95956 xmax: -120.381 ymax: 35.11134
## geographic CRS: NAD83
## GEOID NAMELSAD STATEFP state COUNTYFP county city cbsa cbsa_code geometry
## 7797 06079012304 Census Tract 123.04 06 California 079 San Luis Obispo Nipomo San Luis Obispo-Paso Robles-Arroyo Grande, CA 42020 MULTIPOLYGON (((-120.6504 3...
## 7797.1 06079012304 Census Tract 123.04 06 California 079 San Luis Obispo Arroyo Grande San Luis Obispo-Paso Robles-Arroyo Grande, CA 42020 MULTIPOLYGON (((-120.6504 3...
# Hard-code one value to be able to keep city attribute unique by GEOID
geom.joinready2 <-
geom.joinready %>%
mutate(city = case_when(GEOID=="06079012304" ~ "Arroyo Grande", TRUE ~ city)) %>%
unique()
dim(geom.joinready2)## [1] 953 10
## [1] 953
## Simple feature collection with 6 features and 9 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -88.10077 ymin: 30.43231 xmax: -84.96303 ymax: 34.00093
## geographic CRS: NAD83
## GEOID NAMELSAD STATEFP state COUNTYFP county city cbsa cbsa_code geometry
## 1 01055001600 Census Tract 16 01 Alabama 055 Etowah Gadsden Gadsden, AL 23460 MULTIPOLYGON (((-85.99934 3...
## 2 01097005300 Census Tract 53 01 Alabama 097 Mobile Chickasaw Mobile, AL 33660 MULTIPOLYGON (((-88.10077 3...
## 3 01003011102 Census Tract 111.02 01 Alabama 003 Baldwin Fairhope Daphne-Fairhope-Foley, AL 19300 MULTIPOLYGON (((-87.88641 3...
## 4 01101002500 Census Tract 25 01 Alabama 101 Montgomery Montgomery Montgomery, AL 33860 MULTIPOLYGON (((-86.28735 3...
## 5 01027959000 Census Tract 9590 01 Alabama 027 Clay Ashland <NA> <NA> MULTIPOLYGON (((-86.01156 3...
## 6 01113030800 Census Tract 308 01 Alabama 113 Russell Phenix City Columbus, GA-AL 17980 MULTIPOLYGON (((-85.01535 3...
# Join geography/geometry file with AQS outcomes and ACS variables
AQI.ACS.geom <- inner_join(geom.joinready2,AQI.ACS,by="GEOID") %>% group_by()
head(AQI.ACS.geom)## Simple feature collection with 6 features and 33 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -88.10077 ymin: 30.43231 xmax: -84.96303 ymax: 34.00093
## geographic CRS: NAD83
## # A tibble: 6 x 34
## GEOID NAMELSAD STATEFP state COUNTYFP county city cbsa cbsa_code pm2.5_p50 pm2.5_pct_ungre… pm2.5_flag10_un… pm2.5_flag20_un… age.median commute.time.ag… income.hh.media… rent.median.100 home.pmt.median… race.black.any ethn.nhisp.whit… married kids bachelor.plus veteran manufacturing farm.fish.mining commutes.car.al… renter single.fam.home worked.12mo poverty.12mo snap.12mo denominator
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0105… Census … 01 Alab… 055 Etowah Gads… Gads… 23460 7.9 13.2 >10% 0-20% 37.1 2.39 3.62 8.41 8.4 68.6 30.6 28.2 30.6 18.0 10.0 34.4 1.77 86.5 28.7 93.7 63.8 28.7 12.7 3300
## 2 0109… Census … 01 Alab… 097 Mobile Chic… Mobi… 33660 8 9.82 0-10% 0-20% 34.2 1.30 3.28 5.98 10.5 37.9 57.3 33.5 49.2 16.1 9.10 13.8 0 88.3 45.4 75.1 61.2 27.5 22.9 1970
## 3 0100… Census … 01 Alab… 003 Baldw… Fair… Daph… 19300 7.3 7.48 0-10% 0-20% 43 4.66 6.49 9.37 13.8 2.17 89.1 60.9 44.2 39.1 10.7 9.52 0 83.8 7.48 83.7 75.6 7.23 6.37 4385
## 4 0110… Census … 01 Alab… 101 Montg… Mont… Mont… 33860 8.45 19.3 >10% 0-20% 35.3 1.61 3.41 7.8 8.56 66.4 21.6 33.3 48.3 19.5 6.56 16.8 0 92.7 54.5 90.1 66.7 24.6 34.3 2317
## 5 0102… Census … 01 Alab… 027 Clay Ashl… <NA> <NA> 7.3 9.35 0-10% 0-20% 47.1 2.17 2.69 3.94 9.71 17.0 72.0 42.0 31.7 13.3 7.41 31.6 0 90.9 31.0 67.4 64.2 37.4 17.4 2975
## 6 0111… Census … 01 Alab… 113 Russe… Phen… Colu… 17980 8.61 25.6 >10% >20% 39.5 2.59 2.10 6.04 9.44 92.9 6.52 21.2 25.7 4.31 9.70 17.0 0 85.3 46.4 71.7 58.0 45.1 28.7 3254
## # … with 1 more variable: geometry <MULTIPOLYGON [°]>
## [1] "sf" "tbl_df" "tbl" "data.frame"
Describe your results and include relevant tables, plots, and code/comments used to obtain them. End with a brief conclusion of your findings related to the question you set out to address. You can include references if you’d like, but this is not required.
XXXXXXXXXXXXXXXXXPerform bivariate tests of community characteristics vs air quality (continous measures and 1/0 for poor air quality above threshold). Create multivariate models to predict air quality from community characteristics. Construct a classifier using community characteristics and evaluate its performance against the real data. XXXXXXXXXXXXXXXXX
Continuous PM2.5 outcomes by tract, mapped:
AQI.ACS.geom.WGS84 <- st_transform(AQI.ACS.geom, 4326) # FROM 4269 TO 4326 to cooperate with leaflet
palette <- colorNumeric("viridis", NULL, reverse=TRUE)
# Median PM2.5 by tract - interactive
popup_msg <- paste0(AQI.ACS.geom.WGS84$county," County, ",AQI.ACS.geom.WGS84$state,", ",AQI.ACS.geom.WGS84$NAMELSAD
,"<br>Metro Area: ",AQI.ACS.geom.WGS84$cbsa
,"<br>Median PM2.5: ",round(AQI.ACS.geom.WGS84$pm2.5_p50,digits=2))
leaflet(AQI.ACS.geom.WGS84) %>%
addPolygons(fillColor = ~palette(pm2.5_p50)
,fillOpacity = 0.9
,weight = 1.5
,color = "black"
,popup = popup_msg) %>%
addTiles() %>%
addLegend("bottomright",
pal=palette,
values=~pm2.5_p50,
title = 'PM2.5',
opacity = 1) %>%
addScaleBar()